{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "08_Playing_with_word_embedding_colab.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.8"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/mlelarge/dataflowr/blob/master/CEA_EDF_INRIA/08_Playing_with_word_embedding_colab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "collapsed": true,
        "id": "8TOhzssTBZAb",
        "colab_type": "text"
      },
      "source": [
        "# Finding Synonyms and Analogies\n",
        "\n",
        "This notebook is taken from a [PyTorch NLP tutorial](https://github.com/joosthub/pytorch-nlp-tutorial-ny2018/blob/master/day_1/0_Using_Pretrained_Embeddings.ipynb) source: [repository for the training tutorial as the 2018 O'Reilly AI Conference in NYC on April 29 and 30, 2018](https://github.com/joosthub/pytorch-nlp-tutorial-ny2018)\n",
        "\n",
        "Another related sources [chapter 10.6](http://www.d2l.ai/chapter_natural-language-processing/similarity-analogy.html) of [Dive into Deep Learning](http://www.d2l.ai/index.html) and [chapter 10.5](http://www.d2l.ai/chapter_natural-language-processing/glove.html) explaining Word Embedding with Global Vectors (GloVe)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AIoJq1YtB0f5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install annoy"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HhJ0hUC5BZAe",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from annoy import AnnoyIndex\n",
        "import numpy as np\n",
        "import torch\n",
        "from tqdm import tqdm_notebook\n",
        "from argparse import Namespace"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dhHjwm2JBZAl",
        "colab_type": "text"
      },
      "source": [
        "Glove embeddings can be downloaded from [GloVe webpage](https://nlp.stanford.edu/projects/glove/).\n",
        "\n",
        "To get the dataset, you can also add this drive folder to your own google drive account : https://drive.google.com/open?id=1Ht_snd9YSON7d1wv8whWwqmfqdXmjOUJ"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f76c_EM1Ch3q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "## SETUP\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NWUdh2BxBZAn",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "args = Namespace(\n",
        "    glove_filename='glove.6B.100d.txt'\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rlPQwfDzD9lf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%cd /content/drive/My\\ Drive/CEA_EDF_INRIA_NLP/data/\n",
        "%ls"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4GX0hecoBZAr",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def load_word_vectors(filename):\n",
        "    word_to_index = {}\n",
        "    word_vectors = []\n",
        "    \n",
        "    with open(filename) as fp:\n",
        "        for line in tqdm_notebook(fp.readlines(), leave=False):\n",
        "            line = line.split(\" \")\n",
        "            \n",
        "            word = line[0]\n",
        "            word_to_index[word] = len(word_to_index)\n",
        "            \n",
        "            vec = np.array([float(x) for x in line[1:]])\n",
        "            word_vectors.append(vec)\n",
        "            \n",
        "    return word_to_index, word_vectors"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lDUMIr66BZAz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "word_to_index, word_vectors = load_word_vectors(args.glove_filename)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jvIOgxSRBZA8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(word_vectors)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3iaYd15JBZBB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "word_vectors[0].shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "IlkJErGvBZBG",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "word_to_index['beautiful']"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Y3K8WuWEBZBL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class PreTrainedEmbeddings(object):\n",
        "    def __init__(self, glove_filename):\n",
        "        self.word_to_index, self.word_vectors = load_word_vectors(glove_filename)\n",
        "        self.word_vector_size = len(self.word_vectors[0])\n",
        "        \n",
        "        self.index_to_word = {v: k for k, v in self.word_to_index.items()}\n",
        "        self.index = AnnoyIndex(self.word_vector_size, metric='euclidean')\n",
        "        print('Building Index')\n",
        "        for _, i in tqdm_notebook(self.word_to_index.items(), leave=False):\n",
        "            self.index.add_item(i, self.word_vectors[i])\n",
        "        self.index.build(50)\n",
        "        print('Finished!')\n",
        "    \n",
        "    def get_embedding(self, word):\n",
        "        return self.word_vectors[self.word_to_index[word]]\n",
        "    \n",
        "    def closest(self, word, n=1):\n",
        "        vector = self.get_embedding(word)\n",
        "        nn_indices = self.index.get_nns_by_vector(vector, n)\n",
        "        return [self.index_to_word[neighbor] for neighbor in nn_indices]\n",
        "    \n",
        "    def closest_v(self, vector, n=1):\n",
        "        nn_indices = self.index.get_nns_by_vector(vector, n)\n",
        "        return [self.index_to_word[neighbor] for neighbor in nn_indices]\n",
        "    \n",
        "    def sim(self, w1, w2):\n",
        "        return np.dot(self.get_embedding(w1), self.get_embedding(w2))\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bZzQW7pyBZBP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove = PreTrainedEmbeddings(args.glove_filename)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Stpp4FNaBZBT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove.closest('apple', n=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ARYYKOwcBZBa",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove.closest('chip', n=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "e0-ywu08BZBe",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove.closest('baby', n=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AtS8hOuPBZBm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "glove.closest('beautiful', n=5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jhPAEA5lBZBs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def SAT_analogy(w1, w2, w3):\n",
        "    '''\n",
        "    Solves problems of the type:\n",
        "    w1 : w2 :: w3 : __\n",
        "    '''\n",
        "    closest_words = []\n",
        "    try:\n",
        "        w1v = glove.get_embedding(w1)\n",
        "        w2v = glove.get_embedding(w2)\n",
        "        w3v = glove.get_embedding(w3)\n",
        "        w4v = w3v + (w2v - w1v)\n",
        "        closest_words = glove.closest_v(w4v, n=5)\n",
        "        closest_words = [w for w in closest_words if w not in [w1, w2, w3]]\n",
        "    except:\n",
        "        pass\n",
        "    if len(closest_words) == 0:\n",
        "        print(':-(')\n",
        "    else:\n",
        "        print('{} : {} :: {} : {}'.format(w1, w2, w3, closest_words[0]))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rlxpUPIhBZBw",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('man', 'he', 'woman')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "07UCvv3fBZB4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('fly', 'plane', 'sail')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gOHAs9gIBZB9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('beijing', 'china', 'tokyo')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VdpkkLgNBZCC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('man', 'woman', 'son')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M019lyRbBZCH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('man', 'leader', 'woman')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Dmuw3B3zBZCL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "SAT_analogy('woman', 'leader', 'man')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qeWK5ASVBZCP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}